Skip to content

build: Cargo workspace + native-common extraction (1/6)#104

Merged
andygrove merged 8 commits into
apache:mainfrom
timsaucer:split/01-workspace-foundation
Jun 13, 2026
Merged

build: Cargo workspace + native-common extraction (1/6)#104
andygrove merged 8 commits into
apache:mainfrom
timsaucer:split/01-workspace-foundation

Conversation

@timsaucer

@timsaucer timsaucer commented Jun 12, 2026

Copy link
Copy Markdown
Member

This is part of a 6 PR stack. They build on each other, so review and merge in order — until the earlier parts merge, this PR's diff includes their changes too.

➤ 1. #104 — Cargo workspace + native-common extraction
2. #105datafusion-spark-bridge Rust SDK
3. #106 — Spark connector Java SPI
4. #107 — Spark DataSource V2 connector (Scala)
5. #108 — Bridge scaffold generator
6. #109 — End-to-end examples

Purpose

This work enables exporting a DataFusion TableProvider as a Spark DataSource.

This PR prepares the build for the follow on work. We currently have a single rust crate, but with this work we switch to a workspace containing two crates. The follow on work will add a third crate specific to the Spark work.

We split off common JNI plumbing into the new crate datafusion-jni-common. Both the existing datafusion-jni crate and the upcoming spark bridge will both depend on this new crate.

We move the rust target to rust-target/ instead of being within the native/target folder so that we have a single build directory for the workspace. We override the folder name to avoid maven deleting the folder unintentionally.

This is a pure refactor — no behavior change.

timsaucer and others added 4 commits June 12, 2026 13:58
…e-common

Move the standalone `native` crate into a root Cargo workspace and extract
shared JNI plumbing (error->exception mapping, Tokio runtime singleton,
StreamingReader) into a new `datafusion-jni-common` crate under `native-common/`.
`native/src/errors.rs` moves to `native-common/src/errors.rs`; the nine native
modules now import error/runtime helpers from `datafusion_jni_common`.

Build glue follows: single root `Cargo.lock`, `.cargo/config.toml` redirects
output to `rust-target/`, Makefile/CI/poms updated to build `--workspace` and
target `-p datafusion-jni`. Core javadoc build commands updated to match.

Pure refactor; no behavior change. First of a 6-PR stack splitting the Spark
DataSource V2 connector work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-jni-common

Add [workspace.package] (version, edition, license, repository) to the root
manifest and have both crates inherit it via `*.workspace = true`, so a single
bump re-versions the workspace in lock step.

Make datafusion-jni-common publishable: drop `publish = false` and add a
`description` (crates.io requires it). All its dependencies are registry
crates, so nothing blocks publish. datafusion-jni stays `publish = false`
since it is a JVM-loaded cdylib, not a crates.io library.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `**/META-INF/services/**` and `spark/scaffold/bridge-template/**` RAT
excludes guard files that do not exist until later splits in the stack. They
are dead config here. Re-added in the splits that introduce the files they
cover (META-INF/services in 04-spark-scala-connector, bridge-template in
05-bridge-scaffold).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `**/*.md` exclude dropped every markdown file repo-wide from license
checking, not just docs. It was redundant (docs are already covered by
`docs/**`) and overly broad. Removing it; RAT still reports 0 unapproved
because the remaining markdown carries valid headers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the split/01-workspace-foundation branch from 9f78869 to 9d2096c Compare June 12, 2026 12:12
timsaucer and others added 4 commits June 12, 2026 14:18
Add a crate README so the crates.io listing has front-page content, and wire
it in via `readme = "README.md"`. The ASF license header is an HTML comment
(matching dev/release/README.md) so RAT approves it while it stays invisible
in the rendered markdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on the workspace-foundation PR:

- development.md: trim the repo-layout section to the crates this PR
  actually ships (native, native-common). It was forward-referencing
  spark/, spark/bridge, datafusion-spark-bridge, and examples/native --
  none of which exist until later PRs in the stack -- and called the
  member list "three" while listing four. Later PRs (apache#105/apache#106/apache#107/apache#109)
  carry notes to re-add their own slice when those dirs land.

- rat_exclude_files.txt: the Rust lockfile moved to the workspace root,
  so the stale native/Cargo.lock entry left the root Cargo.lock with no
  RAT exclude for the source-tarball check (check-rat-report.py). Point
  it at Cargo.lock.

- native-common: dedupe the panic-payload downcast -- StreamingReader::next
  now calls errors::panic_message instead of repeating the String/&str
  match inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Cargo workspace conversion redirects build output to rust-target/
(via .cargo/config.toml), but the dev/release scripts still built from
native/ and read native/target/release/, which is no longer populated
even when cargo runs inside native/ (config is discovered up-tree).

- build-native-libs.sh / build-release.sh: build from the repo root with
  `-p datafusion-jni` and copy from rust-target/{release,<triple>/release}/.
- verify-release-candidate.sh: run `cargo fmt --all` workspace-wide so the
  new native-common crate is covered (matches CI lint.yml).
- updating-datafusion-version.md: list the workspace.dependencies entries
  that actually exist (datafusion, -proto, -spark, -substrait); drop the
  stray datafusion-ffi reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No release script or doc describes publishing the Rust crates to
crates.io, and the crate is an internal implementation detail of the
native crates (its README already says the API may change to track
their needs). Match `publish = false` on datafusion-jni so an
accidental `cargo publish` can't push it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer requested a review from coderfender June 12, 2026 12:51
@timsaucer

timsaucer commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@paleolimbot @coderfender Here is the first of six stacked PRs to add a spark data source that uses a datafusion table provider underneath. This one only moves things around in the repo to prepare for adding in the new crate. The full original PR #103 is way too large to review effectively so I've broken it into stages.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Rust native build into a Cargo workspace and extracts shared JNI infrastructure into a new datafusion-jni-common crate, while updating Maven/scripts/CI to build from the workspace root and to use rust-target/ as the shared Cargo output directory.

Changes:

  • Introduce a workspace root Cargo.toml and new native-common (datafusion-jni-common) crate; migrate shared JNI/runtime/streaming/error plumbing into it.
  • Repoint build outputs to rust-target/ via .cargo/config.toml and update Maven/Makefile/CI/release scripts to use workspace builds and the new artifact paths.
  • Update docs and test comments to reflect new build commands and workspace dependency management.

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pom.xml Pin maven-compiler-plugin version; update RAT/license excludes for rust-target/ and root Cargo.lock.
Cargo.toml Add workspace root, members, shared package metadata, and centralized [workspace.dependencies].
Cargo.lock Move to workspace root; lockfile content updated for the new workspace resolution.
.cargo/config.toml Set Cargo target-dir = "rust-target" to keep Rust artifacts out of Maven-cleaned target/.
.gitignore Ignore rust-target/ in addition to target/.
Makefile Build/format/clean via workspace root; update runtime-metrics build command to use -p datafusion-jni.
native/Cargo.toml Convert to workspace-inherited metadata/deps; add dependency on datafusion-jni-common.
native/src/lib.rs Remove local errors module + inline StreamingReader; use datafusion-jni-common errors/runtime/streaming.
native/src/arrow.rs Switch JNI error handling imports to datafusion-jni-common.
native/src/avro.rs Switch JNI error handling imports to datafusion-jni-common.
native/src/cache_manager.rs Switch JniResult import to datafusion-jni-common.
native/src/csv.rs Switch JNI error handling imports to datafusion-jni-common.
native/src/json.rs Switch JNI error handling imports to datafusion-jni-common.
native/src/object_store.rs Switch JniResult import to datafusion-jni-common.
native/src/proto.rs Switch JNI error handling imports to datafusion-jni-common.
native/src/runtime_metrics.rs Switch JniResult import and update rebuild instructions for workspace package selection.
native/src/schema.rs Switch JniResult import to datafusion-jni-common.
native-common/Cargo.toml Add new datafusion-jni-common crate definition + avro feature forwarding.
native-common/src/lib.rs Add shared runtime singleton + StreamingReader implementation.
native-common/src/errors.rs Add shared error-to-exception mapping + exported panic_message.
native-common/README.md Document purpose/linking model/status of the new shared native crate.
core/pom.xml Update expected native library path to rust-target/... and update failure guidance text.
core/src/main/java/org/apache/datafusion/SessionContext.java Update user-facing build instructions for substrait/runtime-metrics to include -p datafusion-jni.
core/src/test/java/org/apache/datafusion/SessionContextRuntimeStatsTest.java Update build instructions for runtime-metrics feature to include -p datafusion-jni.
core/src/test/java/org/apache/datafusion/SessionContextSubstraitTest.java Update build instructions for substrait feature to include -p datafusion-jni.
.github/workflows/build.yml Update cargo cache paths/keys for workspace root lockfile + rust-target/.
.github/workflows/lint.yml Run formatting/clippy at workspace root; update cargo cache paths/keys.
docs/source/contributor-guide/development.md Update dev instructions and describe new workspace + rust-target/ behavior.
docs/source/contributor-guide/updating-datafusion-version.md Update version-bump procedure to reflect workspace dependency pinning.
dev/release/build-release.sh Update native build/copy paths and build commands for workspace + rust-target/.
dev/release/datafusion-java-rm/build-native-libs.sh Build datafusion-jni from repo root and reference rust-target/... outputs.
dev/release/rat_exclude_files.txt Exclude root Cargo.lock instead of native/Cargo.lock.
dev/release/verify-release-candidate.sh Run cargo fmt workspace-wide instead of only in native/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timsaucer timsaucer marked this pull request as ready for review June 12, 2026 13:37
@timsaucer

Copy link
Copy Markdown
Member Author

I considered renaming native to something like core and putting all the crates into a crates folder because the new native-common doesn't feel exactly right.

@paleolimbot

Copy link
Copy Markdown
Member

Very cool! I'm happy to review this work because it's very relevant to what we do (Spark via Sedona Spark and DataFusion via SedonaDB), with the limitation that I don't have any experience with JNI (just general arrow-over-c). I do have a few questions about the approach before I get started.

Notably, these are a few ways to avoid maintaining a pile of JNI (by exploiting other piles of JNI that other people have written already!)

One approach is to use ADBC as the FFI bridge. As a TableProvider author, a thin layer of code (a few hundred lines) can get you a cdylib with the ADBC entrypoint defined, and on the Java side there is already a driver manager that can import the cdylib and "run a SQL query", getting one FFI_ArrowArrayStream per partition back. You have to figure out how to squeeze your scan options into the ADBC interface but you're already serializing/deserializing most of it. It might look like statement.set_option("my_provider.options", "<bytes>"), statement.set_option("my_provider.projection", "1,2,3,4,"), statement.set_option("my_provider.filter", "<serialized_filter>"). and statement.execute_partitions() would just take an empty string. You could also do something fancier with Substrait (ADBC has a statement.set_query_substrait()). There are two pieces of prior art here to start from ( https://github.com/apache/sedona-db/tree/f9a9227cfa41cb71448d28c21330996fb657a3c1/rust/sedona-adbc/src , https://github.com/adbc-drivers/datafusion ).

Another approach is to extract the FFI of the table provider you actually need and add it to datafusion-ffi. You would still need some JNI on the Java side to import the entrypoints you've defined but because the surface is smaller than ADBC it's potentially less work. This has the advantage that you can shape it around the TableProvider more explicitly and iterate. It is also closer to the shape that Comet will need (cdylib + entrypoint serialized to protobuf), although there are probably a few shapes that would work there.

At a high level I think my complaint about the approach in this stack is that you are proposing to build a cdylib that exports JNI entrypoints, whereas a cleaner approach (in my opinion) is to build a cdylib that exports entrypoints that just use the Arrow C Data/Stream interface and C types. That also has broader applicability to non-Java (i.e., can live in datafusion proper and get eyes/reviews from a wider audience).

@andygrove andygrove left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @timsaucer

@andygrove andygrove merged commit ab57cb9 into apache:main Jun 13, 2026
6 checks passed
@andygrove

Copy link
Copy Markdown
Member

I missed the conversation on here before merging. Sorry about that.

@timsaucer

Copy link
Copy Markdown
Member Author

It’s my fault. I should have moved it to draft. I’ve been traveling and lost track

@paleolimbot

Copy link
Copy Markdown
Member

All good - we'll need dedicated Rust and Java components no matter how this goes. I am busy trying to get SedonaDB 0.4 out this week but can prototype the extract the FFI of the table provider you actually need as a PR into datafusion-ffi.

@timsaucer

Copy link
Copy Markdown
Member Author

@paleolimbot I started to churn on this and put up this PR: #110

I'm going to try digging in more tomorrow. I'll be out on vacation starting Thursday, so there's no rush on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants